Structured prediction or structured (output) learning is an

umbrella term In linguistics, semantics, general semantics, and ontologies, hyponymy () is a semantic relation between a hyponym denoting a subtype and a hypernym or hyperonym (sometimes called umbrella term or blanket term) denoting a supertype. In other wor ...

for supervised machine learning techniques that involves predicting structured objects, rather than scalar

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit *Discrete group, a ...

real Real may refer to: Currencies * Brazilian real (R$) * Central American Republic real * Mexican real * Portuguese real * Spanish real * Spanish colonial real Music Albums * ''Real'' (L'Arc-en-Ciel album) (2000) * ''Real'' (Bright album) (2010) ...

values. Similar to commonly used supervised learning techniques, structured prediction models are typically trained by means of observed data in which the true prediction value is used to adjust model parameters. Due to the complexity of the model and the interrelations of predicted variables the process of prediction using a trained model and of training itself is often computationally infeasible and

approximate inference Approximate inference methods make it possible to learn realistic models from big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated w ...

and learning methods are used.

Applications

For example, the problem of translating a

natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...

sentence into a syntactic representation such as a

parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...

can be seen as a structured prediction problem in which the structured output domain is the set of all possible parse trees. Structured prediction is also used in a wide variety of application domains including

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...

, and

computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...

Example: sequence tagging

Sequence tagging is a class of problems prevalent in

, where input data are often sequences (e.g. sentences of text). The sequence tagging problem appears in several guises, e.g.

part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...

and

named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...

. In POS tagging, for example, each word in a sequence must receive a "tag" (class label) that expresses its "type" of word: : The main challenge of this problem is to resolve

ambiguity Ambiguity is the type of meaning in which a phrase, statement or resolution is not explicitly defined, making several interpretations plausible. A common aspect of ambiguity is uncertainty. It is thus an attribute of any idea or statement ...

: the word "sentence" can also be a verb in English, and so can "tagged". While this problem can be solved by simply performing

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

of individual tokens, that approach does not take into account the empirical fact that tags do not occur independently; instead, each tag displays a strong

conditional dependence In probability theory, conditional dependence is a relationship between two or more events that are dependent when a third event occurs.Introduction to Artificial Intelligence by Sebastian Thrun and Peter Norvig, 201"Unit 3: Conditional Dependenc ...

on the tag of the previous word. This fact can be exploited in a sequence model such as a

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...

that predicts the entire tag sequence for a sentence, rather than just individual tags, by means of the

Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especiall ...

Techniques

Probabilistic

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...

s form a large class of structured prediction models. In particular,

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...

s and

random field In physics and mathematics, a random field is a random function over an arbitrary domain (usually a multi-dimensional space such as \mathbb^n). That is, it is a function f(x) that takes on a random value at each point x \in \mathbb^n(or some other d ...

s are popular. Other algorithms and models for structured prediction include inductive logic programming,

case-based reasoning In artificial intelligence and philosophy, case-based reasoning (CBR), broadly construed, is the process of solving new problems based on the solutions of similar past problems. In everyday life, an auto mechanic who fixes an engine by recalli ...

structured SVM The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structure ...

Markov logic network A Markov logic network (MLN) is a probabilistic logic which applies the ideas of a Markov network to first-order logic, enabling uncertain inference. Markov logic networks generalize first-order logic, in the sense that, in a certain limit, all uns ...

Probabilistic Soft Logic Probabilistic Soft Logic (PSL) is a statistical relational learning (SRL) framework for modeling probabilistic and relational domains. It is applicable to a variety of machine learning problems, such as collective classification, entity reso ...

, and

constrained conditional model A constrained conditional model (CCM) is a machine learning and inference framework that augments the learning of conditional (probabilistic or discriminative) models with declarative constraints. The constraint can be used as a way to incorporate e ...

s. Main techniques: *

Conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...

Structured support vector machine The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structure ...

* Structured k-Nearest Neighbours *

Recurrent neural network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

, in particular Elman network

Structured perceptron

One of the easiest ways to understand algorithms for general structured prediction is the structured perceptron of

Collins Collins may refer to: People Surname Given name * Collins O. Bright (1917–?), Sierra Leonean diplomat * Collins Chabane (1960–2015), South African Minister of Public Service and Administration * Collins Cheboi (born 1987), Kenyan middle- ...

. This algorithm combines the

perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...

algorithm for learning

linear classifier In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the val ...

s with an inference algorithm (classically the

when used on sequence data) and can be described abstractly as follows. First define a "joint feature function" Φ(x, y) that maps a training sample x and a candidate prediction y to a vector of length ''n'' (x and y may have any structure; ''n'' is problem-dependent, but must be fixed for each model). Let GEN be a function that generates candidate predictions. Then: :Let

w

be a weight vector of length ''n'' :For a pre-determined number of iterations: ::For each sample

x

in the training set with true output

t

: :::Make a prediction

\hat=\, \\,(^\, \phi(, ))

:::Update

w

, from

\hat

t

=+(-\phi(, \hat)+ \phi(, ))

c

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...

In practice, finding the argmax over

({x})

will be done using an algorithm such as Viterbi or an algorithm such as max-sum, rather than an

exhaustive search In computer science, brute-force search or exhaustive search, also known as generate and test, is a very general problem-solving technique and algorithmic paradigm that consists of systematically enumerating all possible candidates for the soluti ...

through an exponentially large set of candidates. The idea of learning is similar to multiclass perceptron.

References

* Noah Smith
Linguistic Structure Prediction
2011. * Michael Collins
Discriminative Training Methods for Hidden Markov Models
2002.

External links

Implementation of Collins structured perceptron